Team Illinois Members
Julia Dawson, NETID: juliaed3
Joseph Stewart, NETID: jstew7
Saumya Singh, NETID: saumyas2
The purpose of this project is to attempt to build models which can be used to accurately predict premature death based on various health and socio-economic factors. The data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings ranks each county within each state based on a variety of health measures obtained from national and state data sources. One of the many attributes contained in the raw data is the “Years of Potential Life Lost Rate” (YPLLRate) which represents premature death for the a given geographical region. The YPLLRate value is the number of years lost per 100,000 people based on the various health and socio-economic factors present in the data. This project seeks to model the interaction between the YPLLRate serving as the response variable and the other attributes serving as the predictor variables.
The raw data for this project were obtained from https://www.countyhealthrankings.org/. County Health Rankings obtained this data from various nation-wide and state data sources. The raw data consisted of over 240 variables and the preliminary cleaning reduced this number to 96 by removing redundancies and sparsely populated variables.
The redundant variables included many examples of providing a quantity and a percentage of some factor. For example, there were variables for the number of smokers in a region as well as the percentage of the population that smoked. The percentage variables were chosen over the quantity variables and the quantity variable were omitted There were also many race-specific variables that had N/A values in too many observations to be useful as predictors. These variables were omitted as well.
The remaining predictor variables are made up of categorical geographic data such as state and county, health related numeric variables such as percentage of low birth weight, percentage of smokers, and percentage of adults with diabetes, and socio-economic numeric variables such as percentage of single parent households, percentage of people with some college, and percentage of people unemployed.
The objective of this project is to identify which combination of the remaining predictors most-contributes to premature death.
The large number of predictors in the cleaned data set made model selection very difficult. No single approach seemed to emerge as the best route so we opted to pursue several approaches and compare the results. Model-selection algorithm proved to be computationally infeasible in that it ran for exceptionally long times and the inclusion of the State and Region categorical variables made the number of permutations far too high to be useful. Our three approaches are similar in some ways and different in others as detailed below.
This approach involved attempting to manually select a smaller set of predictors based on their textual descriptions and their individual linear relationships to the response variable.
data_1 = read_csv("data/model_1_data.csv")
pairs(data_1)
Below are plots of the relationships between the response and several predictors.
Transforming predictor variables had no effect on model improvement, but performing a log transformation on the YPLLRate response variable did show improvement. The model was selected by a forward AIC selection with the scope being set to the full interactive model of the chosen predictors.
empty_model = lm(log(YPLLRate) ~ 1, data = data_1)
selected_model = step(empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate *
Income20thpcntl * InjuryDeathRate *
LBWPct * LifeExpect *
MammAnnualPct * SmokersPct *
UnemployedPct * UninsAdultPct *
VaccinatedPct,
direction = "forward", trace = 0)
influential_points = which(cooks.distance(selected_model) > 4 / length(cooks.distance(selected_model)))
outliers = as.vector(as.integer(names(rstandard(selected_model)[abs(rstandard(selected_model)) > 2])))
noise = c(outliers, influential_points)
new_data_1 = data_1[-noise, ]
new_empty_model = lm(log(YPLLRate) ~ 1, data = new_data_1)
final_model_1 = step(new_empty_model, scope = log(YPLLRate) ~ HealthBadDaysMentAvg * HospPrevRate *
Income20thpcntl * InjuryDeathRate *
LBWPct * LifeExpect *
MammAnnualPct * SmokersPct *
UnemployedPct *
UninsAdultPct * VaccinatedPct,
direction = "forward", trace = 0)
In this method, we started by selecting a small number of predictors that showed a strong connection to the response variable “Years of Potential Life Lost Rate.” This selection was done based on the study of the variables from the data source https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation
This set of variables were added to smaller file to make the loading and knitting of the markdown document more efficient.
model_2_DF = read_csv("data//model_2_data.csv")
model_2_DF$`WaterViol` = as.factor(model_2_DF$`WaterViol`)
model_2_DF = model_2_DF[sample(1:nrow(model_2_DF), 2000, replace=FALSE),]
# Split into test and train
indexes = sample(nrow(model_2_DF), 1000)
train_df = model_2_DF[indexes,]
test_df = model_2_DF[-indexes,]
# FULL model, with all predictors
model_2_full = lm(YPLLRate ~ . , data = model_2_DF)
# Check for Outliers and Influence.
outliers = as.vector(as.integer(names(rstandard(model_2_full)[abs(rstandard(model_2_full)) > 2])))
high_influence = as.vector(which(cooks.distance(model_2_full) > 4 / length(cooks.distance(model_2_full))))
# Remove outliers
remove_noise = c(outliers, high_influence)
model_2_DF_clean = model_2_DF[-remove_noise,]
model_2_full_cl = lm(YPLLRate ~ . , data = model_2_DF_clean)
(model_2_aic = step(model_2_full_cl, direction = "backward", trace=0))
##
## Call:
## lm(formula = YPLLRate ~ LBWPct + ObeseAdultsPct + DrinkExcPct +
## HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian +
## HomeownersPct, data = model_2_DF_clean)
##
## Coefficients:
## (Intercept) LBWPct ObeseAdultsPct
## 2009.30060 290.81267 40.03548
## DrinkExcPct HSGradRate DistressFreqPhysPct
## -116.70675 14.61183 303.14913
## HIVRate IncHousehldMedian HomeownersPct
## 0.35770 -0.04601 30.04804
We tried several approaches to improve this model but the best model we could achieve was by applying box-cox transform to the response variable of a model selected via backward-AIC. model
boxcox(model_2_aic, lambda = seq(-0.25, 0.75, by = 0.05), plotit = TRUE)
model_2_cox = lm(formula = (((YPLLRate ^ 0.6) - 1) / 0.6) ~ LBWPct + ObeseAdultsPct + DrinkExcPct + HSGradRate + DistressFreqPhysPct + HIVRate + IncHousehldMedian + HomeownersPct, data = model_2_DF_clean)
Plots
We started with a script that merged two sheets from the original data. The team chose to keep variables that contained rates and to drop race demographics. We started with 92 variables and 3141 observations. For this approach, we then dropped columns with too many NA’s (more than our result YPLLRate) and then omitted observations containing ‘NA’. This left us with 71 variables and 2333 observations. Lastly, several variables were coerced to factors and short names were added.
adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. It indicates the percent of the variation in the output variables that are explained by the input variables.
LOOCV RMSE: An observation is removed and the model is fit the the remaining data and this fit used to predict the value of the deleted observation. This is repeated, n times, for each of the n observations and the mean square error is computed.
Standard Deviation: We divided LOOCV RMSE by standard deviation in order to normalize RMSE. This allowed us to compare RMSE between models, including models with transformed results. All three models applied a different transformation to their result.
Correlation Coefficients were used to assess how highly correlated predictors are with the result. They were also used to consider which predictors may be interacting.
Partial Correlation Coefficients The partial correlation coefficient was used in predictor selection in addition to correlation because it it controls for confounding variables in a model.
Shapiro-Wilk Test: Shapiro test assesses the normality of errors. The null hypothesis assumes data were sampled from a normal distribution. A small p-value rejects the null hypothesis.
Breusch-Pagan-Godfrey Test: The null hypothesis for the bptest is constant variance (homoscedasticity). As always if the p-value is below a threshold (\(\alpha = 0.05\)) then heteroscedasticity is assumed. Heteroscedasticity occurs when the variance of the error terms differ across observations. The BP test statistic shows how big the problem is. Smaller is better.
Result selection: We chose YPLLRate as our response variable because it represented the overall data and it had the most highly correlated predictors. YPLL is the years of potential life lost before age 75 per 100,000 population (age-adjusted). YPLLRate is a measure of premature death and is used on an interactive site https://www.countyhealthrankings.org/explore-health-rankings to allow people to compare their State and County to others. The statement is on their home page is interesting… “The annual Rankings provide a revealing snapshot of how health is influenced by where we live, learn, work, and play. They provide a starting point for change in communities.” . Data is gathered from a variety of sources to create a model for comparison between States and Counties (https://www.countyhealthrankings.org/explore-health-rankings/measures-data-sources/2020-measures). Several other variables were considered to be the result but YPLLRate was chosen as the best representative of the data.
Predictor selection:
Problems with Model_3 and methods to improve it: + LOOCV RMSE (Cross-Validated RMSE) was too high (355) - Outliers were removed and improved LOOCV RMSE by 30%. - Transforming a few predictors with log and higher order terms improved the adjusted-rsquared and LOOCV RMSE. - Adding some two-way interactions improved the adjusted r-squared and LOOCV RMSE slightly. - BoxCox recommended a lambda of 0.8. Transformation of response \(YPLLRate^\lambda\) produced the best results (rather than \(Y^\lambda\) rather than \(\frac{(Y^\lambda - 1)}\lambda\)). It improved LOOCV RMSE 50% and improved the normality problems with response YPLLRate. The LOOCV RMSE was well below the standard deviation of YPLLRate. The large tails on the Q-Q plot shrank to almost nothing.
Model 3 has a good adjusted R-squared value of 0.972. LOOCV RMSE is 17.5 and when normalized with the standard deviation the normalized RMSE is 0.16. This is a pretty good value. When the data was split into test and train and run 1000 times, the normalized RMSE was 0.17, confirming our LOOCV RMSE result. The transformation with BoxCox lambda greatly improved the Q-Q plot and the tails were minimal. The Shapiro test had a high P-value which support the null hypothesis of normality of errors. The remaining problem is it fails the BPtest with BP value of 131 and p-value very low. The model still suffers from heteroscedasticity but does well on the other measures. BP value was much improved from over 300 before the BoxCox transform and removal of influential observations. Model 3 uses 23 predictors out of 93 possible and they are representative of the major contributors to health (environment, income, healthcare, demographics). Given the high correlation between many of the predictors and the high variance of the response, this is a decent model.
Model_1
formula(model_3)
## YPLLRate^lmda_mdl3 ~ HealthPoorPct + FoodEnvirIX + PhysInactPct +
## PhysPrimCareRate + MammAnnualPct + Income20thpcntl + IncomeRatio +
## SocsAssRate + InjuryDeathRate + HousSvrProbPct + HousInadFacil +
## LifeExpect + DeathAgeAdjRate + VaccinatedPct + IncHousehldMedian +
## PopGE65Pct + PopNativePct + PopPacificPct + EnglishNotProfPct +
## PopRuralPct + PopGE65Pct:DeathAgeAdjRate + HealthPoorPct:PopFemalePct +
## LBWPct:IncHousehldMedian
Model_2
formula(model_3)
## YPLLRate^lmda_mdl3 ~ HealthPoorPct + FoodEnvirIX + PhysInactPct +
## PhysPrimCareRate + MammAnnualPct + Income20thpcntl + IncomeRatio +
## SocsAssRate + InjuryDeathRate + HousSvrProbPct + HousInadFacil +
## LifeExpect + DeathAgeAdjRate + VaccinatedPct + IncHousehldMedian +
## PopGE65Pct + PopNativePct + PopPacificPct + EnglishNotProfPct +
## PopRuralPct + PopGE65Pct:DeathAgeAdjRate + HealthPoorPct:PopFemalePct +
## LBWPct:IncHousehldMedian
Model_3
formula(model_3)
## YPLLRate^lmda_mdl3 ~ HealthPoorPct + FoodEnvirIX + PhysInactPct +
## PhysPrimCareRate + MammAnnualPct + Income20thpcntl + IncomeRatio +
## SocsAssRate + InjuryDeathRate + HousSvrProbPct + HousInadFacil +
## LifeExpect + DeathAgeAdjRate + VaccinatedPct + IncHousehldMedian +
## PopGE65Pct + PopNativePct + PopPacificPct + EnglishNotProfPct +
## PopRuralPct + PopGE65Pct:DeathAgeAdjRate + HealthPoorPct:PopFemalePct +
## LBWPct:IncHousehldMedian
Residuals Plot
Q-Q Plot
# Model 1
info_1 = c(get_bp(final_model_1), get_sw(final_model_1), get_num_params(final_model_1), get_loocv_rmse(final_model_1), get_adj_r2(final_model_1), get_rmse(final_model_1), get_AIC(final_model_1))
# Model 2
info_2 = c(get_bp(model_2_cox), get_sw(model_2_cox), get_num_params(model_2_cox), get_loocv_rmse(model_2_cox), get_adj_r2(model_2_cox), get_rmse(model_2_cox), get_AIC(model_2_cox))
# Model 3
info_3 = c(get_bp(model_3), get_sw(model_3), get_num_params(model_3), get_loocv_rmse(model_3), get_adj_r2(model_3), get_rmse(model_3), get_AIC(model_3))
model_info = cbind(info_1, info_2, info_3)
rownames(model_info) = c("BP Test", "Shapiro","No params", "LOOV RMSE", "Adj R2", "RMSE", "AIC")
colnames(model_info) = c("Model 1", "Model 2", "Model 3")
knitr::kable(model_info) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| BP Test | 0.000000e+00 | 5.950000e-05 | 0.000000e+00 |
| Shapiro | 7.459690e-02 | 1.221250e-02 | 3.389923e-01 |
| No params | 2.500000e+01 | 9.000000e+00 | 2.400000e+01 |
| LOOV RMSE | 5.348390e-02 | 2.712907e+01 | 1.750069e+01 |
| Adj R2 | 9.617625e-01 | 7.942199e-01 | 9.716955e-01 |
| RMSE | 5.288420e-02 | 2.700378e+01 | 1.730247e+01 |
| AIC | -1.251407e+04 | 1.228563e+04 | 1.239788e+04 |